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Consumer Reports (1995, November) published an article 
which concluded that patients benefited very substantially 
from psychotherapy, that long-term treatment did consider¬ 
ably better than short-term treatment , and that psycho¬ 
therapy alone did not differ in effectiveness from medication 
plus psychotherapy* Furthermore, no specific modality of 
psychotherapy did better than any other for any disorder; 
psychologists, psychiatrists, and social workers did not 
differ in their effectiveness as treaters; and all did better 
than marriage counselors and long-term family doctoring. 
Patients whose length of therapy or choice of therapist was 
limited by insurance or managed care did worse; The meth¬ 
odological virtues and drawbacks of this large-scale sur¬ 
vey are examined and contrasted with the more traditional 
efficacy study, in which patients are randomized into a 
manualized, fixed duration treatment or into control groups . 
/ conclude that the Consumer Reports survey complements 
the efficacy method t and that the best features of these two 
methods can be combined into a more ideal method that 
will best provide empirical validation of psychotherapy. 


I ow do we find out whether psychotherapy works? 

To answer this, two methods have arisen: the effi- 
m cacy study and the effectiveness study . An efficacy 
study is the more popular method. It contrasts some kind of 
therapy to a comparison group under well-controlled condi¬ 
tions. But there is much more to an efficacy study than just a 
control group, and such studies have become a high-para¬ 
digm endeavor with sophisticated methodology. In the ideal 
efficacy study, all of the following niceties are found: 

1. The patients are randomly assigned to treatment and 
control conditions. 

2. The controls are rigorous: Not only are patients 
included who receive no treatment at all, but placebos con¬ 
taining potentially therapeutic ingredients credible to both 
the patient and the therapist are used in order to control for 
such influences as rapport, expectation of gain, and sympa¬ 
thetic attention (dubbed nonspeciftcs). 

3. The treatments are manualized, with highly detailed 
scripting of therapy made explicit. Fidelity to the manual 
is assessed using videotaped sessions, and wayward 
implementers are corrected. 

4. Patients arc seen for a Fixed number of sessions. 

5. The target outcomes are well operationalized (e.g., 


clinician-diagnosed DSM-IV disorder, number of reported 
orgasms, self-reports of panic attacks, percentage of fluent 
utterances). 

6. Raters and diagnosticians are blind to which group 
the patient comes from. (Contrary to the “double-blind” 
method of drug studies, efficacy studies of psychotherapy 
can be at most “single-blind,” since the patient and therapist 
both know what the treatment is. Whenever you hear some¬ 
one demanding the double-blind study of psychotherapy, 
hold onto your wallet.) 

7. The patients meet criteria for a single diagnosed 
disorder, and patients with multiple disorders are typically 
excluded. 

8. The patients are followed for a fixed period after 
termination of treatment with a thorough assessment 
battery. 

So when an efficacy study demonstrates a difference 
between a form of psychotherapy and controls, academic 
clinicians and researchers take this modality seriously in¬ 
deed. In spite of how expensive and time-consuming they 
are, hundreds of efficacy studies of both psychotherapy and 
drugs now exist—many of them well done* These studies 
show, among many other things, that cognitive therapy, 
interpersonal therapy, and medications ail provide moderate 
relief from unipolar depressive disorder; that exposure and 
clomipramine both relieve the symptoms of obsessive-com¬ 
pulsive disorder moderately well but that exposure has more 
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lasting benefits; that cognitive therapy works very well in 
panic disorder; that systematic desensitization relieves spe¬ 
cific phobias; that "applied tension” virtually cures blood 
and injury phobia; that transcendental meditation relieves 
anxiety; that aversion therapy produces only marginal im¬ 
provement with sexual offenders; that disulfram (Antabuse) 
does not provide lasting relief from alcoholism; that flooding 
plus medication does better in the treatment of agoraphobia 
than either alone; and that cognitive therapy provides sig¬ 
nificant relief of bulimia, outperforming medications alone 
(seeSeligman, 1994, for a review). 

The high praise “empirically validated” is now virtually 
synonymous with positive results in efficacy studies, and 
many investigators have come to think that an efficacy study 
is the “gold standard” for measuring whether a treatment 
works. 

1 also had come to that opinion when I wrote What You 
Can Change What You Can *t (Seligman, 1994). In trying 
to summarize what was known about the effects of the pano¬ 
ply of drugs and psychotherapies for each major disorder, I 
read hundreds of efficacy studies and came to appreciate the 
genre. At minimum I was convinced that an efficacy study 
may be the best scientific instrument for telling us whether a 
novel treatment is likely to work on a given disorder when 
the treatment is exported from controlled conditions into the 
Field. Because treatment in efficacy studies is delivered un¬ 
der tightly controlled conditions to carefully screened pa¬ 
tients, sensitivity is maximized and efficacy studies are very 
useful for deciding whether one treatment is better than 
another treatment for a given disorder. 

But my belief has changed about what counts as a 
“gold standard.” And it was a study by Consumer Reports 
(1995, November) that singlehandedly shook my belief. I 
came to see that deciding whether one treatment, under 
highly controlled conditions, works better than another treat¬ 
ment or a control group is a different question from deciding 


what works in the field (Munoz, Hollon, McGrath, Rehm, & 
VandenBos, 1994). I no longer believe that efficacy studies 
are the only, or even the best, way of finding out what 
treatments actually work in the field. I have come to believe 
that the “effectiveness” study of how patients fare under the 
actual conditions of treatment in the field, can yield useful 
and credible “empirical validation” of psychotherapy and 
medication. This is the method that Consumer Reports 
pioneered. 

What Efficacy Studies Leave Out 

It Is easy to assume that, if some form of treatment is not 
listed among the many which have been “empirically vali¬ 
dated,” the treatment must be inert, rather than just “un¬ 
tested” given the existing method of validation. I will dub 
this the inertness assumption. The inertness assumption is a 
challenge to practitioners, since long-term dynamic treat¬ 
ment, family therapy, and more generally, eclectic psycho¬ 
therapy, are not on the list of treatments empirically validated 
by efficacy studies, and these modalities probably make up 
most of what is actually practiced. I want to look closely at 
the inertness assumption, since the effectiveness strategy of 
empirical validation follows from what is wrong with the 
assumption. 

The usual argument against the inertness assumption 
is that long-term dynamic therapy, family therapy, and eclec¬ 
tic therapy cannot be tested in efficacy studies, and thus we 
have no hard evidence one way or another They cannot be 
tested because they are too cumbersome for the efficacy 
study paradigm. Imagine, for example, what a decent efficacy 
study of long-term dynamic therapy would require: control 
groups receiving no treatment for several years; an equally 
credible comparison treatment of the same duration that has 
the same “nonspecifics”—rapport, attention, and expecta¬ 
tion of gain—but is actually inert; a step-by-step manual 
covering hundreds of sessions; and the random assignment 
of patients to treatments which last a year or more. The 
ethical and scientific problems of such research are daunt¬ 
ing, to say nothing of how much such a study would cost. 

While this argument cannot be gainsaid, it still leaves 
the average psychotherapist in an uncomfortable position, 
with a substantial body of literature validating a panoply of 
short-term therapies the psychotherapist does not perform, 
and with the long-term, eclectic therapy he or she does 
perform unproven. 

But there is a much better argument against the inert¬ 
ness assumption: The efficacy study is the wrong method for 
empirically validating psychotherapy as it is actually done , 
because it omits too many crucial elements of what is done 
in the field. 

The five properties that follow characterize psycho¬ 
therapy as il is done in the field. Each of these properties are 
absent from an efficacy study done under controlled condi¬ 
tions. If these properties are important to patients’ getting 
better, efficacy studies will underestimate or even miss alto¬ 
gether the value of psychotherapy done in the field. 

L Psychotherapy (like other health treatments) in the 
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field is not affixed duration. It usually keeps going until the 
patient is markedly improved or until he or she quits. In 
contrast, the intervention in efficacy studies stops after a 
limited number of sessions—usually about 12—regardless 
of how well or how poorly the patient is doing. 

2. Psychotherapy (again, like other health treatments) 
in the field is self-correcting. If one technique is not working, 
another technique—or even another modality—is usually 
tried. In contrast, the intervention in efficacy studies is con¬ 
fined to a small number of techniques, all within one modality 
and manualized to be delivered in a fixed order. 

3. Patients in psychotherapy in the field often get there 
by active shopping, entering a kind of treatment they ac¬ 
tively sought with a therapist they screened and chose. This 
is especially true of patients who work with independent 
practitioners, and somewhat less so of patients who go to 
outpatient clinics or have managed care. In contrast, patients 
enter efficacy studies by the passive process of random 
assignment to treatment and acquiescence with who and 
what happens to be offered in the study (Howard, Orlinsky, 
i&Lueger, 1994). 

4. Patients in psychotherapy in the field usually have 
multiple problems , and psychotherapy is geared to relieving 
parallel and interacting difficulties. Patients in efficacy stud¬ 
ies are selected to have but one diagnosis (except when two 
conditions are highly comorbid) by a long set of exclusion 
and inclusion criteria. 

5. Psychotherapy in the field is almost always con¬ 
cerned with improvement in the general functioning of pa¬ 
tients, as well as amelioration of a disorder and relief of 
specific, presenting symptoms. Efficacy studies usually fo¬ 
cus only on specific symptom reduction and whether the 
disorder ends. 

It is hard to imagine how one could ever do a scientifi¬ 
cally compelling efficacy study of a treatment which had 
variable duration and self-correcting improvisations and was 
aimed at improved quality of life as well as symptom relief, 
with patients who were not randomly assigned and had 
multiple problems. But this does not mean that the effective¬ 
ness of treatment so delivered cannot be empirically vali¬ 
dated. Indeed it can, but it requires a different method: a 
survey of large numbers of people who have gone through 
such treatments. So let us explore the virtues and drawbacks 
of a well-done effectiveness study, the Consumer Reports 
(1995) one, in contrast to an efficacy study. 

Consumer Reports Survey 

Consumer Reports (CR) included a supplementary survey 
about psychotherapy and drugs in one version of its 1994 
annua! questionnaire, along with its customary inquiries 
about appliances and services. CR 7 s 180,000 readers re¬ 
ceived this version, which included approximately lOGques- 
tions about automobiles and about mental health. CR asked 
readers to fill out the mental health section “if at any time 
over the past three years you experienced stress or other 
emotional problems for which you sought help from any of 
the following: friends, relatives, or a member of the clergy; a 


mental health professional like a psychologist or a psychia¬ 
trist; your family doctor; or a support group.” Twenty-two 
thousand readers responded. Of these, approximately 7,000 
subscribers responded to the mental health questions. Of 
these 7,000, about 3,000 had just talked to friends, relatives, 
or clergy, and 4,100 went to some combination of mental 
health professionals, family doctors, and support groups. Of 
these 4,100, 2,900 saw a mental health professional: Psy¬ 
chologists (37%) were the most frequently seen mental health 
professional, followed by psychiatrists (22%), social work¬ 
ers (14%), and marriage counselors (9%). Other mental health 
professionals made up 18%. In addition, 1,300 joined self- 
help groups, and about 1,000 saw family physicians. The 
respondents as a whole were highly educated, predomi¬ 
nantly middle class; about half were women, and the median 
age was 46. 

Twenty-six questions were asked about mental health 
professionals, and parallel but less detailed questions were 
asked about physicians, medications, and self-help groups: 

* What kind of therapist 

* What presenting problem (e,g„ general anxiety, panic, 
phobia, depression, low mood, alcohol or drugs, grief, 
weight, eating disorders, marital or sexual problems, 
children or family, work, stress) 

* Emotional state at outset (from very poor to very 
good) 

* Emotional state now (from very poor to very good) 

* Group versus individual therapy 

■ Duration and frequency of therapy 

* Modality (psychodynamic, behavioral, cognitive, 
feminist) 

* Cost 

* Health care plan and limitations on coverage 

* Therapist competence 

* How much therapy helped (from made things a lot 
better to made things a lot worse) and in what areas 
(specific problem that led to therapy, relations to 
others, productivity, coping with stress, enjoying life 
more, growth and insight, self-esteem and confidence, 
raising low mood) 

* Satisfaction with therapy 

* Reasons for termination (problems resolved or more 
manageable, felt further treatment wouldn't help, 
therapist recommended termination, a new therapist, 
concerns about therapist’s competence, cost, and 
problems with insurance coverage) 

The data set is thus a rich one, probably uniquely rich, 
and the data analysis was sophisticated. Because I was 
privileged to be a consultant to this study and thus privy to 
the entire data set, much of what I now present will be new to 
you—even if you have read the CR article carefully. C/?’s 
analysts decided that no single measure of therapy effective¬ 
ness would do and so created a multivariate measure. This 
composite had three subscales, consisting of: 

!. Specific improvement (“How much did treatment 
help with the specific problem that led you to therapy?” 
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made no difference; made things somewhat worse ; made 
things a lot worse ; not sure ); 

2. Satisfaction (“Overall how satisfied were you with 
this therapist's treatment of your problems?” completely 
satisfied; very satisfied; fairly well satisfied; somewhat sat- 
isfied ; very dissatisfied; completely dissatisfied); and 

3. Global improvement (how respondents described 
their “overall emotional state” at the time of the survey 
compared with the start of treatment: “very poor. I barely 
managed to deal with things; fairly poor. Life was usually 
pretty tough for me; so-so: I had my ups and downs; quite 
good: I had no serious complaints; very good: Life was much 
the way I liked it to be”). 

Each of the three subscales was transformed and 
weighted equally on a 0-100 scale, resulting in a 0-300 scale 
for effectiveness. The statistical analysis was largely mul¬ 
tiple regression, with initial severity and duration of treat¬ 
ment (the two biggest effects) partialed out. Stringent levels 
of statistical significance were used. 

There were a number of clear-cut results, among them: 

* Treatment by a mental health professional usually 
worked. Most respondents got a lot better. Aver¬ 
aged over all mental health professionals, of the 426 
people who were feeling very poor when they began 
therapy, 87% were feeling very good , good, or at 
least so-so by the time of the survey. Of the 786 
people who were feeling fairly poor at the outset, 


92% were feeling very good , good, or at least so-so 
by the time of the survey. These findings converge 
with meta-analyses of efficacy (Lipsey & Wilson, 
1993; Shapiro & Shapiro, 1982; Smith, Miller, & Glass, 
1980). 

• Long-term therapy produced more improvement than 
short-term therapy. This result was very robust, and 
held up over all statistical models. Figure 1 plots the 
overall rating (on the 0-300 scale defined above) of 
improvement as a function of length of treatment. 
This “dose-response curve” held for patients in both 
psychotherapy alone and in psychotherapy plus 
medication (see Howard, Kopta, Krause, & Orlinsky, 
1986, for parallel dose-response findings for psy¬ 
chotherapy). 

* There was no difference between psychotherapy 
alone and psychotherapy plus medication for any 
disorder (very few respondents reported that they 
had medication with no psychotherapy at all). 

* While all mental health professionals appeared to 
help their patients, psychologists, psychiatrists, and 
social workers did equally well and better than mar¬ 
riage counselors, Their patients' overall improvement 
scores (0-300 scale) were 220,226,225 (not signifi¬ 
cantly different from each other), and 208 (signifi¬ 
cantly worse than the first three), respectively. 

• Family doctors did just as well as mental health pro¬ 
fessionals in the short term, but worse in the long 
term. Some patients saw both family doctors and 


Figure 1 
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Note. N - 2,346. The 3G0-poinf seal© i$ derived from the unweighted sum oF response! to throe 1 OQpoint $ub&cales. The subscaies measured specific improvement 
(i.e., how much treatment helped with problems that led to therapy), satisfaction with therapist, and global improvement {i.e., how respondents felt at time of survey, 
compared with when they began treatment}. 
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mental health professionals, and those who saw both 
had more severe problems. For patients who relied 
solely on family doctors, their overall improvement 
scores when treated for up to six months was 213, 
and it remained at that level (212) for those treated 
longer than six months. In contrast, the overall im¬ 
provement scores for patients of mental health pro¬ 
fessionals was 211 up to six months, but climbed to 
232 when treatment went on for more than six months. 
The advantages of long-term treatment by a mental 
health professional held not only for the specific 
problems that led to treatment, but for a variety of 
general functioning scores as well: ability to relate to 
others, coping with everyday stress, enjoying life 
more, personal growth and understanding, self-es¬ 
teem and confidence. 

* Alcoholics Anonymous (AA) did especially well, 
with an average improvement score of 251, signifi¬ 
cantly bettering mental health professionals. People 
who went to non-AA groups had less severe prob¬ 
lems and did not do as well as those who went to AA 
(average score = 215). 

* Active shoppers and active clients did better in treat¬ 
ment than passive recipients (determined by re¬ 
sponses to “Was it mostly your idea to seek therapy? 
When choosing this therapist, did you discuss quali¬ 
fications, therapist’s experience, discuss frequency, 
duration, and cost, speak to someone who was treated 
by this therapist, check out other therapists? During 
therapy, did you try to be as open as possible, ask for 
explanation of diagnosis and unclear terms, do home¬ 
work, not cancel sessions often, discuss negative 
feelings toward therapist?”). 

* No specific modality of psychotherapy did any bet¬ 
ter than any other for any problem. These results 
confirm the “dodo bird” hypothesis, that all forms of 
psychotherapies do about equally well (Luborsky, 
Singer, & Luborsky, 1975). They come as a rude 
shock to efficacy researchers, since the main theme 
of efficacy studies has been the demonstration of 
the usefulness of specific techniques for specific 
disorders. 


* Respondents whose choice of therapist or duration 
of care was limited by their insurance coverage did 
worse, as presented in Table 1 (determined by re¬ 
sponses to "Did limitations on your insurance cover¬ 
age affect any of the following choices you made? 
Type of therapist I chose; How often L met with my 
therapist; How long I stayed in therapy”). 

These findings are obviously important, and some of 
them could not be included in the original CR article because 
of space limitations. Some of these findings were quite con¬ 
trary to what I expected, but it is not my intention to discuss 
their substance here. Rather, I want to explore the method¬ 
ological adequacy of this survey. My underlying questions 
are “Should we believe the findings?” and “Can the method 
be improved to give more authoritative answers?” 


Consumer Reports Survey; Methodological 
Virtues 

Sampling , This survey is, as far as 1 have been able 
to determine, the most extensive study of psychotherapy 
effectiveness on record. The sample is not representative of 
the United States as a whole, but my guess is that it is 
roughly representative of the middle class and educated 
population who make up the bulk of psychotherapy patients. 
It is important that the sample represents people who choose 
to go to treatment for their problems, not people who do not 
“believe in” psychotherapy or drugs. The CR sample, more¬ 
over, is probably weighted toward “problem solvers ” people 
who actively try to do something about what troubles them. 

Treatment duration, CR sampled all treatment 
durations from one month or less through two years or more. 
Because the study was naturalistic, treatment, it can be 
supposed, continued until the patient (a) was better, (b) gave 
up unimproved, or (c) had his or her coverage run out. This, 
by definition, mirrors what actually happens in the field. In 
contrast to all efficacy studies, which are of fixed treatment 
duration regardless of how the patient is progressing, the CR 
study informs us about treatment effectiveness under the 
duration constraints of actual therapy. 


Table 1 

Limitations on Insurance Coverage and Improvement 


Coverage limited Coverage not limited 


Limitations on your 
insurance coverage 

Percent checking item* 

Overall score 

Specific improvemeni 

Overall scare 

Specific improvement 

Type of iherapist 1 chose 

20 

211 

77 

224 

83 

How oflen 1 met wilh my therapisl 

26 

214 

79 

224 

82 

How long 1 stayed in therapy 

24 

212 

78 

224 

83 

Percent of any of ihe above 

43 

212 

78 

226 

83 


Note. N - 2,900, All differences for I he overall scores were statistically significant at p < 01 The same held true for the specific score, except for "How often I met with 
my therapist/ which was significant at p < .05, Statistical controls for both severity and deration were applied. Source: Consumer Reports 1994 Annual Questionnaire, 
•multiple responses permitted. 
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Figure 2 

Improvement for Presenting Symptoms 
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Note, N* 2,738. Percentage of respondents who reported thot treatment '"made 
things a lot better'" with respect to the specific problem that led to treatment by 
psychiatrists, psychologists,, social workers, marriage counselors, or family doctors, 
segregated by those treated for mare than six months and those treated for lefs 
than six month s. 


tom relief, is almost always a goal of actual treatment but 
rarely of efficacy studies, the CR study adds lo our knowl¬ 
edge of how treatment does beyond the mere elimination of 
symptoms. 

Clinical significance. There has been much debate 
about how to measure the “clinical significance’ 1 of a treat¬ 
ment. Efficacy studies are designed to detect statistically 
significant differences between a treatment and control 
groups, and an “effect size” can be computed. But what 
degree of statistical significance is clinical significance? How 
large an effect size is meaningful? The CR study leaves little 
doubt about the human significance of its findings, since 
respondents answered directly about how much therapy 
helped the problem that led them to treatment—from made 
things a lot better to made things a lot worse. Of those who 
started out feeling very poor , 54% answered treatment made 
things a lot better ; and another one third answered it made 
things somewhat better. 

Unbiased . Finally, it cannot be ignored that CR is 
about as unbiased a scrutinizer of goods and services as 
exists in the public domain* They have no axe to grind for or 
against medications, psychotherapy, managed care, insur¬ 
ance companies, family doctors, A A, or long-term treatment 
They do not care if psychologists do better or worse than 
psychiatrists, marriage and family counselors, or social work¬ 
ers* They are not pursuing government grants or drug com¬ 
pany favors* They do not accept advertisements. They have 
a track record of loyalty only to consumers* So this study 


Self-carreciian, Because the CR study was natu¬ 
ralistic, it informs us of how treatment works as it is actually 
performed—without manuals and with self-correction when 
a technique falters* This also contrasts favorably to efficacy 
studies, which are manualized and not self-conecting when 
a given technique or modality fails* 

Multiple problems . The large majority of respon¬ 
dents in the CR study had more than one problem. We can 
also assume that a good-sized fraction were “subclinical” in 
their problems and would not meet DSM-IV criteria for any 
disorder* No patients were discarded because they failed 
exclusion criteria or because they fell one symptom short of 
a full-blown “disorder. 11 Thus the sample more closely re¬ 
flected people who actually seek treatment than the filtered 
and single-disordered patients of efficacy studies* 

General functioning ■ The CR study measured 
self-reported changes in productivity at work, interpersonal 
relations, well-being, insight, and growth, in addition to 
improvement on the presenting problem. Improvement on 
the presenting problem is shown in Figure 2; improvement 
over work and social domains is shown in Figure 3; and 
improvement over personal domains is shown in Figure 4. 
Importantly, more improvement on the presenting problem 
occurred for treatments which lasted longer than six months. 
In addition, more improvement occurred in work, interper¬ 
sonal relations, enjoyment of life, and personal growth do¬ 
mains in treatments which lasted longer than six months. 
Since improvement in general functioning, as well as symp- 


Figure 3 

Improvement Over Work ond Social Domains 



Note. N- 2,738. Mean percentage who reported that treatment "made things a 
I at better" with respect to three domains: ability to relote to others, productivity at 
work, and coping with everyday stress. Those treated by psychiatrists, psychologists* 
social workers, marriage counselors, and physicians are segregated by treatment 
for more than six months versus treatment for less than six months. 
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Figure 4 

Improvement Over Personal Domains 



Note. N» 2, 738. Mean percentage who reported that treatment "made things o 
lot better" with respect to bur domains: enjoying life more, personal growth ond 
insight, self-esteem ond confidence, and alleviating low moods. Those treated by 
psychiatrists, psychologists, social workers, marriage counselors, and physicians 
ore segregated by treatment for more thon six months versus treatment for less than 
six months. 


comes with higher credibility than studies that issue from 
drug houses, from either APA, from consensus conferences 
of the National Institute of Mental Health, or even from the 
halls of academe. 

In summary, the main methodological virtue of the CR 
study is its realism: It assessed the effectiveness of psycho¬ 
therapy as it is actually performed in the field with the popula¬ 
tion that actually seeks it, and it is the most extensive, care¬ 
fully done study to do this. This virtue is akin to the virtues of 
naturalistic studies using sophisticated correlational meth¬ 
ods, in contrast to well-controlled, experimental studies. But 
because it is not a we 11-controlled, experimental study like an 
efficacy study, the CR study has a number of serious method¬ 
ological flaws. Let us examine each of these flaws and ask to 
what extent they compromise the CR conclusions. 

Consumer Reports Study: Methodological 
Rows and Rebuttals 

Sampling, Is there a bias such that those respon¬ 
dents who succeed in treatment selectively return their 
questionnaires? CR, not surprisingly, has gone to consider¬ 
able lengths to find out if its reader’s surveys have sampling 
bias. The annual questionnaires are lengthy and can run to 
100 questions or more. Moreover, the respondents not only 
devote a good deal of their own time to filling these out but 


also pay their own postage and are not compensated. So the 
return rate is rather low absolutely, although the 13% return 
rate for this survey was normal for the annual questionnaire. 
But it is still possible that respondents might differ sys¬ 
tematically from the readership as a whole. For the mental 
health survey (and for their annual questionnaires gener¬ 
ally), CR conducted a “validation survey/’ in which postage 
was paid and the respondent was compensated. This re¬ 
sulted in a return rate of 38%, as opposed to the 13% 
uncompensated return rate, and there were no differences 
between data from the two samples. 

The possibility of two other kinds of sampling bias, 
however, is notable, particularly with respect to the remark- 
ably good results for AA. First, since AA encourages life¬ 
time membership, a preponderance of successes—rather 
than dropouts—would be more likely in the three-year lime 
slice (e,g„ “Have you had help in the last three years?”). 
Second, AA failures are often completely dysfunctional and 
thus much less likely to be reading CR and filling out exten¬ 
sive readers 1 surveys than, say, psychotherapy failures 
who were unsuccessfully treated for anxiety. 

A similar kind of sampling bias, to a lesser degree, 
cannot be overlooked for other kinds of treatment failures. 
At any rate, it is quite possible that there was a large 
oversampling of successful AA cases and a smaller 
oversampling of successful treatment for problems other 
than alcoholism. 

Could the benefits of long-term treatment be an artifact 
of sampling bias? Suppose that people who are doing well in 
treatment selectively remain in treatment, and people who 
are doing poorly drop out earlier. In other words, the early 
dropouts are mostly people who fail to improve, but later 
dropouts are mostly people whose problem resolves. CR 
disconfirmed this possibility empirically: Respondents re¬ 
ported not only when they left treatment but why, including 
leaving because their problem was resolved. The dropout 
rates due to the resolution of the problem were uniform 
across duration of treatment (less than one month = 60%; I- 
2 months - 66%; 3-6 months = 67%, 7-11 months = 67%; 1- 
2 years = 67%; over two years = 68%). 

A more sweeping limit on genera liability comes from 
the fact that the entire sample chose their treatment. To one 
degree or another, each person believed that psychotherapy 
and/or drugs would help him or her. To one degree or 
another, each person acknowledged that he or she had a 
problem and believed that the particular mental health pro¬ 
fessional seen and the particular modality of treatment cho¬ 
sen would help them. One cannot argue compellingty from 
this survey that treatment by a mental health professional 
would prove as helpful to troubled people who deny their 
problems and who do not believe in and do not choose 
treatment. 

No control groups. The overall improvement rates 
were strikingly high across the entire spectrum of treat¬ 
ments and disorders in the CR study. The vast majority of 
people who were feeling very poor oz fairly poor when they 
entered therapy made “substantial” (now feeli ngfairly good 
or very good ) or “some” (now feelingso-.ro) gains. Perhaps 
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the best news for patients was that those with severe prob¬ 
lems got, on average, much better. While this may be a 
ceiling effect, it is a ceiling effect with teeth. It means that if 
you have a patient with a severe disorder now, the chances 
are quite good that he or she will be much better within three 
years. But methodologically, such high rates of improvement 
are a yellow flag, cautioning us that global improvement over 
time alone, rather than with treatment or medication, may be 
the underlying mechanism. 

More generally, because there are no control groups, 
the CR study cannot tell us directly whether talking to sym¬ 
pathetic friends or merely letting time pass would have pro¬ 
duced just as much improvement as treatment by a mental 
health professional. The CR survey, unfortunately, did not 
ask those who just talked to friends and clergy to fill out 
detailed questionnaires about the results. 

This is a serious objection, but there are internal con¬ 
trols which perform many of the functions of control groups. 
First, marriage counselors do significantly worse than psy¬ 
chologists, psychiatrists, and social workers, in spite of no 
significant differences in kind of problem, severity of prob¬ 
lem, or duration of treatment. Marriage counselors control 
for many of the non specifics, such as therapeutic all iance, 
rapport, and attention, as well as for passage of time. Second, 
there is a dose-response curve, with more therapy yielding 
more improvement. The first point in the dose-response 
curve approximates no treatment: people who have less than 
one month of treatment have on average an improvement 
score of 201, whereas people who have over two years of 
treatment have an average score of 241. Third, psycho¬ 
therapy does just as well as psychotherapy plus drugs for all 
disorders, and there is such a long history of placebo con¬ 
trols inferior to these drugs that one can infer that psycho¬ 
therapy likely would have outperformed such controls had 
they been run. Fourth, family doctors do significantly worse 
than mental health professionals when treatment continues 
beyond six months. An objection might be made that since 
total length of time in treatment—rather than total amount of 
contact—is the covariate, comparing family doctors who do 
not see their patients weekly with mental health profession¬ 
als—who see their patients once a week or more—is not fair. 
It is, of course, possible that if family doctors saw their 
patients as frequently as psychologists do, the two groups 
would do equally well. It was notable, however, that there 
were a significant number of complaints about family doc¬ 
tors: 22% of respondents said their doctor had not "provided 
emotional support'*; 15% said their doctor "seemed uncom¬ 
fortable discussing emotional issues**; and 18% said their 
doctor was "too busy to spend time talking to me,” At any 
rate, the CR survey shows that long-term family doctoring 
for emotional problems—as it is actually performed in the 
field—is inferior to long-term treatment by a mental health 
professional as it is actually performed in the field. 

It is also relevant that the patients attributed their im¬ 
provement to treatment and not time (determined by re¬ 
sponses to “How much do you feel that treatment helped 
you in the following areas?”), and I conclude that the ben¬ 
efits of treatment are very unlikely to be caused by the mere 
passage of time. But I also conclude that the CR study could 
be improved by control groups whose members are not 


treated by mental health professionals, matched for seventy 
and kind of problem (but beware of the fact that random 
assignment will not occur). This would allow the Bayesian 
inference that psychotherapy works better than talking to 
friends, seeing an astrologer, or going to church to be made 
more confidently. 

Self-report* C/?*s mental health survey data, as for 
cars and appliances, are self-reported. Improvement, diagno¬ 
sis, insurance coverage, even kind of therapist are not veri¬ 
fied by external check. Patients can be wrong about any of 
these, and this is an undeniable flaw. 

But two things can be said in response. First, the noise 
self-reports introduce—inaccuracy about improvement, in¬ 
correctness about the nature of their problem, even inaccu¬ 
racy about what kind of a therapist they saw—may be ran¬ 
dom rather than systematic, and therefore would not neces¬ 
sarily bias the study toward the results found. Self-report, in 
principle, can be either rosier or more dire than the report of 
an external observer. Since most respondents are probably 
more emotionally invested in psychotherapy than in their 
automobiles, however, it will take further research to deter¬ 
mine whether the noise introduced by self-report about 
therapy is random or systematic. 

Second, the most important potential inaccuracy pro¬ 
duced by self-report is inaccuracy about respondents* own 
emotional state before and after treatment, and inaccuracy in 
ratings of improvement in the specific problem, in productiv¬ 
ity at work, and in human relationships. This is, however, an 
ever-present inaccuracy even with an experienced diagnosti¬ 
cian, and the correlations between self-report and diagnosis 
are usually quite high (not surprising, given the common 
method variance). Such self-reports are the blood and guts 
of a clinical diagnosis. But multiple observers are always a 
virtue, and diagnosis by a third party would improve the 
survey method noticeably. 

Blindness * The CR survey is not double-blind, or 
even single-blind. The respondent rates his or her own emo¬ 
tional state, and knows what treatment he or she had. So it is 
possible that respondents exaggerate the virtues or vices of 
their treatment to comply with or to overthrow their hypoth¬ 
eses about what CR wants to find. I find this far-fetched: If 
nonblindness compromised readers’ surveys, CR would have 
long ago ceased publishing them, since the readers* evalua¬ 
tions of other products and services are always nonblind. 
CR validates its data for goods and services in two ways: 
against manufacturers* polls and for consistency over time. 
Using both methods, CR has been unable to detect system¬ 
atic distortions in its nonblind surveys of goods and ser¬ 
vices. 

Inadequate outcome measures * CR's indexes 
of improvement were molar. Responses like made things a 
lot better to the question “How much did therapy help you 
with the specific problems that led you to therapy?** tap into 
gross processes. More molecular assessment of improve¬ 
ment, for example, “How often have you cried in the last two 
weeks?** or “How many ounces of alcohol did you have 
yesterday?** would increase the validity of the method. Such 
detail would, of course, make the survey more cumbersome. 

A variant of this objection is that the outcome mea- 
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sures were insensitive. This objection looms large in light of 
the failure to find that any modality of therapy did better than 
any other modality of therapy for any disorder. Perhaps if 
more detailed, disorder-specific measures were used, the 
dodo bird hypothesis would have been disconfirmed. 

A third variant of this objection is that the outcome 
measures were poorly normed. Questions like “How satisfied 
were you with this therapist’s treatment of your problem? 
Completely satisfied, very satisfied, fairly well satisfied, 
somewhat dissatisfied, very dissatisfied, completely dissat¬ 
isfied f and “How would you describe your overall emo¬ 
tional state? very poor: I hardy managed to deal with things; 
fairly poor. Life was usually pretty tough for me; so-so : I had 
my ups and downs; quite good ; I had no serious complaints; 
very good ; Life was much the way f liked it to be” are seat-of- 
the-pants items which depend almost entirely on face valid¬ 
ity, rather than on several generations of norming. So the 
conclusion that 90% of those people who started off very 
poor or fairly poor wound up in the very good, fairly good , 
or so-so categories does not guarantee that they had re¬ 
turned to normality in any strong psychometric sense. The 
addition of extensively normed questionnaires like the Beck 
Depression inventory would strengthen the survey method 
(and make it more cumbersome). 

Retrospective • The CR respondents reported retro¬ 
spectively on their emotional states. While a one-time sur¬ 
vey is highly cost-effective, it is necessarily retrospective. 
Retrospective reports are less valid than concurrent obser¬ 
vation, although an exception is worth noting: waiting for the 
rosy afterglow of a newly completed therapy to dissipate, as 
the CR study does, may make for a more sober evaluation .The 
retrospective method does not allow for longitudinal obser¬ 
vation of the same individuals for improvement across time. 
Thus the benefits of long-term psychotherapy are inferred 
by comparing different individuals’ improvements cross- 
sectionally. A prospective study would allow comparison of 
the same individuals’ improvements over time. 

Retrospective observation is a flaw, but it may intro¬ 
duce random rather than systematic noise in the study of 
psychotherapy effectiveness. The distortions introduced by 
retrospection could go either in the rosier or more dire direc¬ 
tion, but only further research will tell us if the distortions of 
retrospection are random or systematic. 

It is noteworthy that Consumer Reports generally uses 
two methods. One is the laboratory test, in which, for ex¬ 
ample, a car is crashed into a wall at five miles per hour, and 
damage to the bumper is measured. The other is the reader’s 
survey. These two methods parallel the efficacy study and 
the effectiveness study, respectively, in many ways. If retro¬ 
spection was a fatal flaw, CR would have given up the 
reader’s survey method long ago, since reliability of used 
cars and satisfaction with airlines, physicians, and insurance 
companies depends on retrospection. Regardless, the sur¬ 
vey method could be markedly improved by being longitudi¬ 
nal, in the same way as an efficacy study. Self-report and 
diagnosis both could be done before and after therapy, and a 
thorough follow-up carried out as well. But retrospective 
reports of emotional states will always be with us, since even 
in a prospective study that begins with a diagnostic inter¬ 


view, the patient retrospectively reports on his or her (pre¬ 
sumably) less troubled emotional state before the diagnosis. 

Therapy junkies * Perhaps the important finding 
that long-term therapy does so much better than short-term 
therapy is an artifact of therapy “junkies,” individuals so 
committed to therapy as a way of life that they bias the 
results in this direction. This is possible, but it is not an 
artifact. Those people who spend a long time in therapy may 
well be “true believers.” Indeed, the long-term patients are 
distinct: They have more severe problems initially, are more 
likely to have an emotional disorder, are more likely to get 
medications, are more likely to see a psychiatrist, and are 
more likely to have psychodynamic treatment than the rest of 
the sample. Regardless, they are probably representative of 
the population served by long-term therapy. This population 
reports robust improvement with long-term treatment in the 
specific problem that got them into therapy, as well as in 
growth, insight, confidence, productivity at work, interper¬ 
sonal relations, and enjoyment of life. 

Perhaps people who had two or more years of therapy 
are likely still to be in therapy and thus unduly loyal to their 
therapist. They might then be more likely to distort in a rosy 
direction. This seems unlikely, since a comparison of people 
who had over two years of treatment and then ended therapy 
showed the same high improvement scores as those with 
over two years of treatment who were still in therapy (242 and 
245, respectively). 

Nonrandom assignment . The possibility of such 
biases could be reduced by random assignment of patients 
to treatment, but this would undermine the central virtue of 
the CR study—reporting on the effectiveness of psycho¬ 
therapy as it is actually done in the field with those patients 
who actually seek it. In fact, the lack of random assignment 
may turn out to be the crucial ingredient in the validity of the 
CR method and a major flaw of the efficacy method. Many 
(but assuredly not all) of the problems that bring consumers 
into therapy have elements of what was called “wanhope” in 
the middle ages and is now called “demoralization.” Choice 
and control by a patient, in and of itself, counteracts wanhope 
(Seligman, 1991). 

Random assignment of patients to a modality or to a 
particular therapist not only undercuts the remora!izing ef¬ 
fects of treatment but also undercuts the nonrandom deci¬ 
sions of therapists in choice of modality for a particular 
patient. Consider, for example, the finding that drugs plus 
psychotherapy did no better than psychotherapy alone for 
any disorder (schizophrenia and bipolar depression were too 
rare for analysis in this sample). The most obvious interpre¬ 
tation is that drugs are useless and do nothing over and 
above psychotherapy. But the lack of random assignment 
should prevent us from leaping to that conclusion. Assume, 
for the moment, that therapists are canny about who needs 
drugs plus psychotherapy and who can do well with psy¬ 
chotherapy alone. The therapists assign those patients ac¬ 
cordingly so appropriate patients get appropriate treatment. 
This is just the same logic as a self-correcting trajectory of 
treatment, in which techniques and modalities are modified 
with the patient’s progress. This means that drugs plus 
psychotherapy may actually have done pretty well after all- 
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but only in a canniiy selected subset of people. 

The upshot of this is that random assignment, the 
prettiest of the methodological niceties in efficacy studies, 
may turn out to be worse than useless for the investigation 
of the actual treatment of mental illness in the field. It is worth 
mulling over what the results of an efficacy or effectiveness 
study might be if half the patients with a particular disorder 
were randomly assigned and were compared with half the 
patients not randomly assigned. Appropriately assigning 
individuals to the right treatment, the right drug, and the 
right sequence of techniques, along with individuals’ choos¬ 
ing a therapist and a treatment they believe in, may be crucial 
to getting better. 

The Ideal Study 

The CR study, then, is to be taken seriously—not only for its 
results and its credible source, but for its method. It is large- 
scale; it samples treatment as it is actually delivered in the 
Field; it samples without obvious bias those who seek out 
treatment; it measures multiple outcomes including specific 
improvement and more global gains such as growth, insight, 
productivity, mood, enjoyment of life, and interpersonal rela¬ 
tions; it is statistically stringent and finds clinically meaning¬ 
ful results. Furthermore, it is highly cost-effective. 

Its major advantage over the efficacy method for study¬ 
ing the effectiveness of psychotherapy and medications is 
that it captures how and to whom treatment is actually deliv¬ 
ered and toward what end. At the very least, the CR study 
and its underlying survey method provides a powerful addi¬ 
tion to what we know about the effectiveness of psycho¬ 
therapy and a pioneering way of finding out more. 

The study is not without flaws, the chief one being the 
limited meaning of its answer to the question "Can psycho¬ 
therapy help?” This question has three possible kinds of 
answers. The first is that psychotherapy does better than 
something else, such as talking to friends, going to church, 
or doing nothing at all. Because it lacks comparison groups, 
the Of study only answers this question indirectly. The 
second possible answer is that psychotherapy returns people 
to normality or more liberally to within, say, two standard 
deviations of the average. The CR study, lacking an un¬ 
troubled group and lacking measures of how people were 
before they became troubled, does not answer this question. 
The third answer is “Do people have fewer symptoms and a 
better life after therapy than they did before?” This is the 
question that the CR study answers with a clear “yes ” 


The CR study can be improved upon, allowing it to 
speak to all three senses of “psychotherapy works,” These 
improvements would combine several of the best features of 
efficacy studies with the realism of the survey method. First, 
the survey could be done prospectively: A large sample of 
those who seek treatment could be given an assessment 
battery before and after treatment, while stilt preserving 
progress-contingent treatment duration, self-correction, mul¬ 
tiple problems, and self-selection of treatment. Second, the 
assessment battery could include well-normed questionnaires 
as well as detailed, behavioral information in addition to more 
global improvement information, thus increasing its sensitiv¬ 
ity and allowing it to answer the return-to-normal question. 
Third, blind diagnostic workups could be included, adding 
multiple perspectives to self-report. 

At any rate, Consumer Reports has provided empirical 
validation of the effectiveness of psychotherapy. Prospec¬ 
tive and diagnostically sophisticated surveys, combined with 
the well-normed and detailed assessment used in efficacy 
studies, would bolster this pioneering study. They would be 
expensive* but, in my opinion, very much worth doing. 
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